AITopics | vh 1

Collaborating Authors

vh 1

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Regret Bounds for Kernel-Based Reinforcement Learning

Domingues, Omar Darwiche, Ménard, Pierre, Pirotta, Matteo, Kaufmann, Emilie, Valko, Michal

arXiv.org Machine LearningApr-12-2020

We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. Unlike existing approaches with regret guarantees, it does not use any kind of partitioning of the state-action space. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $O\left( H^3 K^{\max\left(\frac{1}{2}, \frac{2d}{2d+1}\right)}\right)$, where $d$ is the covering dimension of the joint state-action space. We empirically validate Kernel-UCBVI on discrete and continuous MDPs.

algorithm, artificial intelligence, upstream oil & gas, (20 more...)

arXiv.org Machine Learning

2004.05599

Country:

Europe > Hungary (0.14)
North America > United States (0.14)

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Upstream (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)

Add feedback

Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces

Touati, Ahmed, Taiga, Adrien Ali, Bellemare, Marc G.

arXiv.org Machine LearningMar-9-2020

Despite the wealth of research into provably efficient reinforcement learning algorithms, most works focus on tabular representation and thus struggle to handle exponentially or infinitely large state-action spaces. In this paper, we consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric that characterizes the proximity between different states and actions. We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space by zooming in more promising and frequently visited regions while carefully balancing the exploitation-exploration trade-off. We show that ZoomRL achieves a worst-case regret $\tilde{O}(H^{\frac{5}{2}} K^{\frac{d+1}{d+2}})$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space with respect to the metric. Moreover, our algorithm enjoys improved metric-dependent guarantees that reflect the geometry of the underlying space. Finally, we show that our algorithm is robust to small misspecification errors.

algorithm, inequality follow, rad, (16 more...)

arXiv.org Machine Learning

2003.04069

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.41)

Add feedback

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Duan, Yaqi, Wang, Mengdi

arXiv.org Machine LearningFeb-21-2020

This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $\chi^2$-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted $\chi^2$-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

inequality, probability, vh 1, (13 more...)

arXiv.org Machine Learning

2002.09516

Country: North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.61)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

Optimistic Policy Optimization with Bandit Feedback

Efroni, Yonathan, Shani, Lior, Rosenberg, Aviv, Mannor, Shie

arXiv.org Machine LearningFeb-19-2020

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $\tilde O(\sqrt{S^2 A H^4 K})$ regret for stochastic rewards. Furthermore, we prove $\tilde O( \sqrt{ S^2 A H^4 } K^{2/3} ) $ regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.

algorithm, good event, optimistic policy optimization, (13 more...)

arXiv.org Machine Learning

2002.08243

Country:

Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(2 more...)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

Russo, Daniel

arXiv.org Artificial IntelligenceJun-6-2019

Exploration is one of the central challenges in reinforcement learning (RL). A large theoretical literature treats exploration in simple finite state and action MDPs, showing that it is possible to efficiently learn a near optimal policy through interaction alone [5, 8, 10, 11, 13-16, 24, 25]. Overwhelmingly, this literature focuses on optimistic algorithms, with most algorithms explicitly maintaining uncertainty sets that are likely to contain the true MDP. It has been difficult to adapt these exploration algorithms to the more complex problems investigated in the applied RL literature. Most applied papers seem to generate exploration through ǫ-greedy or Boltzmann exploration. Those simple methods are compatible with practical value function learning algorithms, which use parametric approximations to value functions to generalize across high dimensional state spaces. Unfortunately, such exploration algorithms can fail catastrophically in simple finite state MDPs [See e.g.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

1906.0287

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)

Add feedback

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Simchowitz, Max, Jamieson, Kevin

arXiv.org Machine LearningMay-9-2019

Reinforcement learning (RL) is a powerful paradigm for modeling a learning agent's interactions with an unknown environment, in an attempt to accumulate as much reward as possible. Because of its flexibility, RL can encode such a vast array of different problem settings - many of which are entirely intractable. Therefore, it is crucial to understand what conditions make it possible for an RL agent to effectively learn about its environment. In this paper, we consider tabular Markov decision processes (MDPs), a canonical RL setting where the agent seeks to learn a policy mapping discrete states x S to one of finitely many actions a A, in attempt to maximize cumulative reward over an episode horizon H. We shall study the regret setting, where the learner plays a policy π for a sequence of episodes k 1, . . .

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Machine Learning

1905.03814

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.66)

Add feedback

Efficient Model-free Reinforcement Learning in Metric Spaces

Song, Zhao, Sun, Wen

arXiv.org Machine LearningMay-1-2019

Model-free Reinforcement Learning (RL) algorithms such as Q-learning [Watkins, Dayan 92] have been widely used in practice and can achieve human level performance in applications such as video games [Mnih et al. 15]. Recently, equipped with the idea of optimism in the face of uncertainty, Q-learning algorithms [Jin, Allen-Zhu, Bubeck, Jordan 18] can be proven to be sample efficient for discrete tabular Markov Decision Processes (MDPs) which have finite number of states and actions. In this work, we present an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space--hence extending efficient model-free Q-learning algorithms to continuous state-action space. Compared to previous model-based RL algorithms for metric spaces [Kakade, Kearns, Langford 03], our algorithm does not require access to a black-box planning oracle.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

1905.00475

Country: Asia > Middle East > Jordan (0.24)

Genre: Research Report (0.40)

Industry:

Transportation (0.34)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Stochastic Lipschitz Q-Learning

Zhu, Xu, Dunson, David

arXiv.org Artificial IntelligenceApr-24-2019

In an episodic Markov Decision Process (MDP) problem, an online algorithm chooses from a set of actions in a sequence of $H$ trials, where $H$ is the episode length, in order to maximize the total payoff of the chosen actions. Q-learning, as the most popular model-free reinforcement learning (RL) algorithm, directly parameterizes and updates value functions without explicitly modeling the environment. Recently, [Jin et al. 2018] studies the sample complexity of Q-learning with finite states and actions. Their algorithm achieves nearly optimal regret, which shows that Q-learning can be made sample efficient. However, MDPs with large discrete states and actions [Silver et al. 2016] or continuous spaces [Mnih et al. 2013] cannot learn efficiently in this way. Hence, it is critical to develop new algorithms to solve this dilemma with provable guarantee on the sample complexity. With this motivation, we propose a novel algorithm that works for MDPs with a more general setting, which has infinitely many states and actions and assumes that the payoff function and transition kernel are Lipschitz continuous. We also provide corresponding theory justification for our algorithm. It achieves the regret $\tilde{\mathcal{O}}(K^{\frac{d+1}{d+2}}\sqrt{H^3}),$ where $K$ denotes the number of episodes and $d$ denotes the dimension of the joint space. To the best of our knowledge, this is the first analysis in the model-free setting whose established regret matches the lower bound up to a logarithmic factor.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

1904.10653

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback